Boston and MoreBoston datalibrary(MASS)
library(DT)
Boston <- Boston %>%
mutate(dis_rad = dis/rad)
datatable(Boston, rownames = FALSE)
caretThere is extensive documentation for the caret package at: https://topepo.github.io/caret/
train and testset.seed(3416)
library(caret)
TRAIN <- createDataPartition(Boston$medv,
p = 0.75,
list = FALSE,
times = 1)
BostonTrain <- Boston[TRAIN, ]
BostonTest <- Boston[-TRAIN, ]
pp_BostonTrain <- preProcess(BostonTrain[, -14],
method = c("center", "scale", "BoxCox"))
pp_BostonTrain
Created from 381 samples and 14 variables
Pre-processing:
- Box-Cox transformation (12)
- centered (14)
- ignored (0)
- scaled (14)
Lambda estimates for Box-Cox transformation:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.9000 -0.1250 0.2000 0.4167 0.7250 2.0000
BostonTrain_pp <- predict(pp_BostonTrain, newdata = BostonTrain)
datatable(BostonTrain_pp, rownames = FALSE)
#
BostonTest_pp <- predict(pp_BostonTrain, newdata = BostonTest)
datatable(BostonTest_pp, rownames = FALSE)
set.seed(123)
library(caret)
myControl <- trainControl(method = "cv", number = 5)
mod_lm <- train(medv ~ .,
data = BostonTrain_pp,
trControl = myControl,
method = "lm")
mod_lm$results$RMSE # Training RMSE
[1] 4.532099
p <- predict(mod_lm, newdata = BostonTest_pp)
RMSE(BostonTest_pp$medv, p) # Test RMSE
[1] 4.357278
set.seed(123)
library(caret)
myControl <- trainControl(method = "cv", number = 5)
mod_fs <- train(medv ~ .,
data = BostonTrain_pp,
trControl = myControl,
method = "leapForward")
mod_fs$results$RMSE # Training RMSE
[1] 5.504753 5.253257 5.037208
mod_fs
Linear Regression with Forward Selection
381 samples
14 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 306, 304, 304, 305, 305
Resampling results across tuning parameters:
nvmax RMSE Rsquared MAE
2 5.504753 0.6664465 4.156683
3 5.253257 0.6941767 3.878030
4 5.037208 0.7174721 3.759551
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 4.
p <- predict(mod_fs, newdata = BostonTest_pp)
RMSE(BostonTest_pp$medv, p) # Test RMSE
[1] 4.491119
set.seed(123)
library(caret)
myControl <- trainControl(method = "cv", number = 5)
mod_be <- train(medv ~ .,
data = BostonTrain_pp,
trControl = myControl,
method = "leapBackward")
mod_be$results$RMSE # Training RMSE
[1] 5.419077 5.183117 4.923204
mod_be
Linear Regression with Backwards Selection
381 samples
14 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 306, 304, 304, 305, 305
Resampling results across tuning parameters:
nvmax RMSE Rsquared MAE
2 5.419077 0.6765806 4.051136
3 5.183117 0.7025103 3.802048
4 4.923204 0.7306559 3.663932
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 4.
p <- predict(mod_be, newdata = BostonTest_pp)
RMSE(BostonTest_pp$medv, p) # Test RMSE
[1] 4.491119
set.seed(123)
myControl <- trainControl(method = "cv", number = 5)
mod_glmnet <- train(medv ~ .,
data = BostonTrain_pp,
trControl = myControl,
method = "glmnet",
tuneLength = 12)
mod_glmnet
glmnet
381 samples
14 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 306, 304, 304, 305, 305
Resampling results across tuning parameters:
alpha lambda RMSE Rsquared MAE
0.1000000 0.006268536 4.531941 0.7687748 3.367907
0.1000000 0.012730881 4.531941 0.7687748 3.367907
0.1000000 0.025855371 4.531941 0.7687748 3.367907
0.1000000 0.052510131 4.532273 0.7688306 3.362414
0.1000000 0.106643752 4.534972 0.7687972 3.351165
0.1000000 0.216584681 4.545219 0.7683642 3.332474
0.1000000 0.439865657 4.581750 0.7661677 3.313576
0.1000000 0.893330938 4.672340 0.7604048 3.315023
0.1000000 1.814281591 4.881971 0.7462077 3.378451
0.1000000 3.684656549 5.273205 0.7200819 3.586951
0.1000000 7.483234110 5.829543 0.6977248 3.974399
0.1818182 0.006268536 4.532652 0.7686910 3.369804
0.1818182 0.012730881 4.532652 0.7686910 3.369804
0.1818182 0.025855371 4.532535 0.7687311 3.368753
0.1818182 0.052510131 4.533629 0.7687463 3.361378
0.1818182 0.106643752 4.538218 0.7685896 3.348988
0.1818182 0.216584681 4.553729 0.7677971 3.332598
0.1818182 0.439865657 4.601703 0.7647862 3.319390
0.1818182 0.893330938 4.724671 0.7563483 3.334967
0.1818182 1.814281591 5.004772 0.7354572 3.467215
0.1818182 3.684656549 5.429657 0.7095588 3.722030
0.1818182 7.483234110 6.083102 0.7001301 4.189152
0.2636364 0.006268536 4.532701 0.7686986 3.370252
0.2636364 0.012730881 4.532701 0.7686986 3.370252
0.2636364 0.025855371 4.532801 0.7687256 3.368260
0.2636364 0.052510131 4.534585 0.7686991 3.360225
0.2636364 0.106643752 4.540826 0.7684459 3.347808
0.2636364 0.216584681 4.563860 0.7670747 3.334517
0.2636364 0.439865657 4.623194 0.7632513 3.326660
0.2636364 0.893330938 4.795359 0.7499570 3.377120
0.2636364 1.814281591 5.100450 0.7274659 3.542196
0.2636364 3.684656549 5.532147 0.7076851 3.821693
0.2636364 7.483234110 6.344984 0.7034900 4.394777
0.3454545 0.006268536 4.532690 0.7687104 3.369953
0.3454545 0.012730881 4.532690 0.7687104 3.369953
0.3454545 0.025855371 4.533282 0.7686987 3.367527
0.3454545 0.052510131 4.535905 0.7686173 3.359008
0.3454545 0.106643752 4.544165 0.7682360 3.347396
0.3454545 0.216584681 4.573411 0.7663759 3.335494
0.3454545 0.439865657 4.649711 0.7610795 3.335982
0.3454545 0.893330938 4.865505 0.7434860 3.422438
0.3454545 1.814281591 5.198336 0.7185241 3.619982
0.3454545 3.684656549 5.638643 0.7059721 3.919628
0.3454545 7.483234110 6.637428 0.6991054 4.628281
0.4272727 0.006268536 4.532932 0.7686693 3.370761
0.4272727 0.012730881 4.532932 0.7686693 3.370761
0.4272727 0.025855371 4.533764 0.7686712 3.366759
0.4272727 0.052510131 4.537203 0.7685368 3.357960
0.4272727 0.106643752 4.548032 0.7679749 3.347680
0.4272727 0.216584681 4.582104 0.7657735 3.335922
0.4272727 0.439865657 4.682991 0.7581518 3.350579
0.4272727 0.893330938 4.923742 0.7383159 3.464337
0.4272727 1.814281591 5.282171 0.7107997 3.694325
0.4272727 3.684656549 5.742911 0.7049116 4.009762
0.4272727 7.483234110 6.948048 0.6873032 4.877146
0.5090909 0.006268536 4.533317 0.7686373 3.370564
0.5090909 0.012730881 4.533315 0.7686378 3.370544
0.5090909 0.025855371 4.534229 0.7686456 3.366047
0.5090909 0.052510131 4.538519 0.7684570 3.357281
0.5090909 0.106643752 4.552562 0.7676570 3.348736
0.5090909 0.216584681 4.592041 0.7650530 3.338071
0.5090909 0.439865657 4.723403 0.7544138 3.370384
0.5090909 0.893330938 4.964161 0.7348584 3.499293
0.5090909 1.814281591 5.327524 0.7084565 3.744212
0.5090909 3.684656549 5.859680 0.7022780 4.097255
0.5090909 7.483234110 7.236792 0.6807820 5.094855
0.5909091 0.006268536 4.533120 0.7686582 3.371010
0.5909091 0.012730881 4.533148 0.7686596 3.370795
0.5909091 0.025855371 4.534739 0.7686160 3.365512
0.5909091 0.052510131 4.539915 0.7683730 3.356933
0.5909091 0.106643752 4.557517 0.7672913 3.350136
0.5909091 0.216584681 4.603600 0.7641448 3.341380
0.5909091 0.439865657 4.765313 0.7503926 3.393861
0.5909091 0.893330938 5.008661 0.7308346 3.541779
0.5909091 1.814281591 5.368626 0.7069940 3.783998
0.5909091 3.684656549 5.991761 0.6973993 4.201063
0.5909091 7.483234110 7.552228 0.6710425 5.325813
0.6727273 0.006268536 4.533445 0.7686308 3.370991
0.6727273 0.012730881 4.533421 0.7686452 3.370394
0.6727273 0.025855371 4.535340 0.7685786 3.365051
0.6727273 0.052510131 4.541398 0.7682810 3.356976
0.6727273 0.106643752 4.562388 0.7669132 3.350888
0.6727273 0.216584681 4.617082 0.7630130 3.344737
0.6727273 0.439865657 4.802676 0.7469290 3.415936
0.6727273 0.893330938 5.057995 0.7261480 3.585362
0.6727273 1.814281591 5.411921 0.7052613 3.820470
0.6727273 3.684656549 6.139067 0.6892940 4.316349
0.6727273 7.483234110 7.868986 0.6582170 5.568722
0.7545455 0.006268536 4.533297 0.7686410 3.371024
0.7545455 0.012730881 4.533541 0.7686421 3.370278
0.7545455 0.025855371 4.536005 0.7685372 3.364639
0.7545455 0.052510131 4.543056 0.7681731 3.357271
0.7545455 0.106643752 4.566707 0.7666016 3.350634
0.7545455 0.216584681 4.632108 0.7617114 3.349562
0.7545455 0.439865657 4.838895 0.7435860 3.439463
0.7545455 0.893330938 5.110714 0.7209550 3.628710
0.7545455 1.814281591 5.456445 0.7033818 3.854451
0.7545455 3.684656549 6.278055 0.6807482 4.422164
0.7545455 7.483234110 8.168997 0.6551503 5.800493
0.8363636 0.006268536 4.533406 0.7686293 3.371177
0.8363636 0.012730881 4.533779 0.7686285 3.370047
0.8363636 0.025855371 4.536601 0.7685010 3.364246
0.8363636 0.052510131 4.544842 0.7680547 3.357559
0.8363636 0.106643752 4.570667 0.7663253 3.349922
0.8363636 0.216584681 4.649037 0.7602025 3.356487
0.8363636 0.439865657 4.868172 0.7409111 3.462787
0.8363636 0.893330938 5.159761 0.7161013 3.667872
0.8363636 1.814281591 5.505150 0.7009574 3.889771
0.8363636 3.684656549 6.405747 0.6751334 4.506194
0.8363636 7.483234110 8.502518 0.6551503 6.061084
0.9181818 0.006268536 4.533403 0.7686292 3.371655
0.9181818 0.012730881 4.533995 0.7686175 3.369815
0.9181818 0.025855371 4.537318 0.7684550 3.363864
0.9181818 0.052510131 4.546752 0.7679263 3.357880
0.9181818 0.106643752 4.575102 0.7659986 3.349782
0.9181818 0.216584681 4.668187 0.7584508 3.365246
0.9181818 0.439865657 4.888527 0.7390907 3.483374
0.9181818 0.893330938 5.200906 0.7121487 3.698759
0.9181818 1.814281591 5.559250 0.6978243 3.928229
0.9181818 3.684656549 6.541536 0.6678327 4.594709
0.9181818 7.483234110 8.888861 0.6551503 6.370009
1.0000000 0.006268536 4.533287 0.7686342 3.371616
1.0000000 0.012730881 4.534241 0.7686032 3.369501
1.0000000 0.025855371 4.538009 0.7684093 3.363463
1.0000000 0.052510131 4.548920 0.7677740 3.358274
1.0000000 0.106643752 4.579824 0.7656495 3.350216
1.0000000 0.216584681 4.689766 0.7564269 3.375471
1.0000000 0.439865657 4.907375 0.7373982 3.503123
1.0000000 0.893330938 5.233937 0.7091310 3.722923
1.0000000 1.814281591 5.619076 0.6937479 3.972824
1.0000000 3.684656549 6.674007 0.6602592 4.687782
1.0000000 7.483234110 9.331357 0.6367301 6.725990
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0.1 and lambda = 0.02585537.
min(mod_glmnet$results$RMSE) # Training RMSE
[1] 4.531941
plot(mod_glmnet)
p <- predict(mod_glmnet, newdata = BostonTest_pp)
RMSE(BostonTest_pp$medv, p) # Test RMSE
[1] 4.333308
set.seed(123)
myControl <- trainControl(method = "cv", number = 5)
mod_lasso <- train(medv ~ .,
data = BostonTrain_pp,
trControl = myControl,
method = "glmnet",
tuneGrid = expand.grid(alpha = 1, lambda = seq(.01, 2, length = 10))
)
mod_lasso
glmnet
381 samples
14 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 306, 304, 304, 305, 305
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
0.0100000 4.533669 0.7686223 3.370832
0.2311111 4.710351 0.7545360 3.384463
0.4522222 4.915221 0.7367576 3.509206
0.6733333 5.080905 0.7223991 3.628188
0.8944444 5.234474 0.7090926 3.723285
1.1155556 5.321999 0.7042096 3.786443
1.3366667 5.403792 0.7014806 3.838754
1.5577778 5.496363 0.6983004 3.895305
1.7788889 5.601179 0.6944543 3.960799
2.0000000 5.717715 0.6895725 4.046256
Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.01.
min(mod_lasso$results$RMSE) # Training RMSE
[1] 4.533669
plot(mod_lasso)
p <- predict(mod_lasso, newdata = BostonTest_pp)
RMSE(BostonTest_pp$medv, p) # Test RMSE
[1] 4.338833
set.seed(123)
myControl <- trainControl(method = "cv", number = 5)
mod_ridge <- train(medv ~ .,
data = BostonTrain_pp,
trControl = myControl,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0, lambda = seq(.01, 2, length = 10))
)
mod_ridge
glmnet
381 samples
14 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 306, 304, 304, 305, 305
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
0.0100000 4.602435 0.7648794 3.299926
0.2311111 4.602435 0.7648794 3.299926
0.4522222 4.602435 0.7648794 3.299926
0.6733333 4.602435 0.7648794 3.299926
0.8944444 4.623534 0.7635154 3.300072
1.1155556 4.657768 0.7612806 3.302805
1.3366667 4.692619 0.7590021 3.306579
1.5577778 4.727545 0.7567271 3.310885
1.7788889 4.762234 0.7544767 3.317390
2.0000000 4.796520 0.7522687 3.324757
Tuning parameter 'alpha' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0 and lambda = 0.6733333.
min(mod_ridge$results$RMSE) # Training RMSE
[1] 4.602435
plot(mod_ridge)
p <- predict(mod_ridge, newdata = BostonTest_pp)
RMSE(BostonTest_pp$medv, p) # Test RMSE
[1] 4.072336
library(rpart)
mod_tree <- rpart(medv ~., data = BostonTrain_pp)
mod_tree
n= 381
node), split, n, deviance, yval
* denotes terminal node
1) root 381 33992.1300 22.71365
2) rm< 0.9537213 330 14327.9400 20.14727
4) lstat>=0.4346317 134 2458.8170 15.11567
8) nox>=0.6116656 83 995.7316 13.03614
16) lstat>=0.9553247 45 324.6720 10.91333 *
17) lstat< 0.9553247 38 228.1350 15.55000 *
9) nox< 0.6116656 51 520.0200 18.50000 *
5) lstat< 0.4346317 196 6157.2980 23.58724
10) lstat>=-1.192108 178 3600.5860 22.67584
20) lstat>=-0.2367871 85 435.8631 20.72588 *
21) lstat< -0.2367871 93 2546.1260 24.45806
42) dis_rad>=-0.005776509 82 1024.9010 23.46707 *
43) dis_rad< -0.005776509 11 840.3873 31.84545 *
11) lstat< -1.192108 18 946.7200 32.60000 *
3) rm>=0.9537213 51 3427.0600 39.31961
6) rm< 1.538942 24 736.3200 33.30000 *
7) rm>=1.538942 27 1048.0560 44.67037 *
library(partykit)
plot(as.party(mod_tree))
rpart.plot::rpart.plot(mod_tree)
set.seed(123)
mod_TR <- train(medv ~ .,
data = BostonTrain_pp,
trControl = myControl,
method = "rpart",
tuneLength = 10)
mod_TR$bestTune
cp
1 0.002852634
mod_TR2 <- rpart(medv ~.,
data = BostonTrain_pp,
cp = mod_TR$bestTune)
rpart.plot::rpart.plot(mod_TR2)
mod_TR2
n= 381
node), split, n, deviance, yval
* denotes terminal node
1) root 381 33992.1300 22.71365
2) rm< 0.9537213 330 14327.9400 20.14727
4) lstat>=0.4346317 134 2458.8170 15.11567
8) nox>=0.6116656 83 995.7316 13.03614
16) lstat>=0.9553247 45 324.6720 10.91333 *
17) lstat< 0.9553247 38 228.1350 15.55000 *
9) nox< 0.6116656 51 520.0200 18.50000 *
5) lstat< 0.4346317 196 6157.2980 23.58724
10) lstat>=-1.192108 178 3600.5860 22.67584
20) lstat>=-0.2367871 85 435.8631 20.72588 *
21) lstat< -0.2367871 93 2546.1260 24.45806
42) dis_rad>=-0.005776509 82 1024.9010 23.46707
84) rm< -0.1881725 24 119.9983 20.35833 *
85) rm>=-0.1881725 58 576.9843 24.75345 *
43) dis_rad< -0.005776509 11 840.3873 31.84545 *
11) lstat< -1.192108 18 946.7200 32.60000 *
3) rm>=0.9537213 51 3427.0600 39.31961
6) rm< 1.538942 24 736.3200 33.30000 *
7) rm>=1.538942 27 1048.0560 44.67037
14) ptratio>=-0.4488431 7 465.9686 38.88571 *
15) ptratio< -0.4488431 20 265.8695 46.69500 *
p <- predict(mod_TR2, newdata = BostonTest_pp)
RMSE(BostonTest_pp$medv, p) # Test RMSE
[1] 4.064799
set.seed(123)
myControl <- trainControl(method = "cv", number = 5)
mod_tb <- train(medv ~ .,
data = BostonTrain_pp,
trControl = myControl,
method = "treebag"
)
mod_tb
Bagged CART
381 samples
14 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 306, 304, 304, 305, 305
Resampling results:
RMSE Rsquared MAE
4.248703 0.8046028 2.851989
min(mod_tb$results$RMSE) # Training RMSE
[1] 4.248703
p <- predict(mod_tb, newdata = BostonTest_pp)
RMSE(BostonTest_pp$medv, p) # Test RMSE
[1] 3.402161
set.seed(123)
myControl <- trainControl(method = "cv", number = 5)
mod_rf <- train(medv ~ .,
data = BostonTrain_pp,
trControl = myControl,
method = "ranger",
tuneLength = 12)
mod_rf
Random Forest
381 samples
14 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 306, 304, 304, 305, 305
Resampling results across tuning parameters:
mtry splitrule RMSE Rsquared MAE
2 variance 3.882660 0.8549853 2.582353
2 extratrees 4.393362 0.8269696 2.883258
3 variance 3.604615 0.8706040 2.398559
3 extratrees 3.959862 0.8555807 2.592945
4 variance 3.483330 0.8757024 2.315875
4 extratrees 3.749063 0.8674375 2.478235
5 variance 3.462847 0.8747808 2.276142
5 extratrees 3.646908 0.8710179 2.419274
6 variance 3.410294 0.8770048 2.277726
6 extratrees 3.537758 0.8771085 2.359015
7 variance 3.377094 0.8779980 2.263806
7 extratrees 3.487552 0.8791181 2.339238
8 variance 3.362000 0.8786081 2.275450
8 extratrees 3.477389 0.8778017 2.333909
9 variance 3.374294 0.8767657 2.287334
9 extratrees 3.427999 0.8801817 2.318521
10 variance 3.325169 0.8795108 2.251780
10 extratrees 3.388989 0.8823772 2.309936
11 variance 3.386182 0.8742704 2.301126
11 extratrees 3.398286 0.8807329 2.298679
12 variance 3.361578 0.8759020 2.283905
12 extratrees 3.431754 0.8771214 2.321189
14 variance 3.404402 0.8726414 2.318482
14 extratrees 3.415521 0.8772402 2.314968
Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 10, splitrule = variance
and min.node.size = 5.
min(mod_rf$results$RMSE) # Training RMSE
[1] 3.325169
plot(mod_rf)
p <- predict(mod_rf, newdata = BostonTest_pp)
RMSE(BostonTest_pp$medv, p) # Test RMSE
[1] 2.840147
set.seed(123)
myControl <- trainControl(method = "cv", number = 5)
mod_gbm <- train(medv ~ .,
data = BostonTrain_pp,
trControl = myControl,
method = "gbm",
tuneLength = 20)
Iter TrainDeviance ValidDeviance StepSize Improve
1 75.5332 -nan 0.1000 14.2640
2 63.9794 -nan 0.1000 9.6499
3 55.0753 -nan 0.1000 8.3653
4 47.2305 -nan 0.1000 7.4566
5 41.8411 -nan 0.1000 5.1717
6 36.2004 -nan 0.1000 4.1376
7 31.6500 -nan 0.1000 3.8808
8 27.8339 -nan 0.1000 4.0313
9 25.0368 -nan 0.1000 2.1682
10 22.5161 -nan 0.1000 2.5988
20 10.5174 -nan 0.1000 0.4122
40 5.9987 -nan 0.1000 -0.0151
60 4.5703 -nan 0.1000 -0.0837
80 3.4782 -nan 0.1000 -0.0295
100 2.7822 -nan 0.1000 -0.0400
120 2.1216 -nan 0.1000 -0.0214
140 1.7425 -nan 0.1000 -0.0360
150 1.5661 -nan 0.1000 -0.0121
mod_gbm
Stochastic Gradient Boosting
381 samples
14 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 306, 304, 304, 305, 305
Resampling results across tuning parameters:
interaction.depth n.trees RMSE Rsquared MAE
1 50 4.428773 0.7888333 3.006482
1 100 4.047960 0.8189926 2.739266
1 150 3.855028 0.8366143 2.639787
1 200 3.743967 0.8444369 2.581693
1 250 3.702706 0.8483022 2.563787
1 300 3.668241 0.8503160 2.539931
1 350 3.655883 0.8508061 2.558279
1 400 3.670828 0.8502705 2.592232
1 450 3.655811 0.8510798 2.565305
1 500 3.675348 0.8495578 2.583326
1 550 3.674615 0.8496574 2.563767
1 600 3.669312 0.8498401 2.573937
1 650 3.645863 0.8511840 2.568466
1 700 3.648841 0.8514472 2.569608
1 750 3.650813 0.8514699 2.569149
1 800 3.650749 0.8515349 2.565597
1 850 3.662456 0.8506679 2.573882
1 900 3.669093 0.8496412 2.590562
1 950 3.668496 0.8502170 2.604685
1 1000 3.675913 0.8492534 2.603852
2 50 3.938788 0.8301776 2.640324
2 100 3.683802 0.8495019 2.499711
2 150 3.585608 0.8576876 2.457644
2 200 3.509749 0.8631718 2.426119
2 250 3.509944 0.8629275 2.407857
2 300 3.493485 0.8637898 2.383128
2 350 3.489513 0.8643085 2.380558
2 400 3.486865 0.8644424 2.386451
2 450 3.495397 0.8640810 2.381447
2 500 3.490253 0.8648950 2.388229
2 550 3.505854 0.8632834 2.394115
2 600 3.491335 0.8651782 2.373485
2 650 3.504520 0.8642002 2.390197
2 700 3.506268 0.8639879 2.396107
2 750 3.491195 0.8653516 2.386409
2 800 3.496766 0.8647312 2.390178
2 850 3.506912 0.8640021 2.404657
2 900 3.493739 0.8651225 2.402403
2 950 3.512886 0.8637512 2.411719
2 1000 3.498261 0.8645953 2.416654
3 50 3.681611 0.8505916 2.504868
3 100 3.499239 0.8629898 2.401405
3 150 3.442695 0.8678567 2.362070
3 200 3.407676 0.8692674 2.349334
3 250 3.413047 0.8697151 2.365462
3 300 3.380727 0.8720552 2.350278
3 350 3.394959 0.8710429 2.361554
3 400 3.397460 0.8708195 2.368077
3 450 3.418578 0.8699498 2.383716
3 500 3.410687 0.8705026 2.387100
3 550 3.398236 0.8717413 2.371035
3 600 3.398235 0.8714174 2.384537
3 650 3.398247 0.8715768 2.371389
3 700 3.395292 0.8718434 2.368661
3 750 3.413495 0.8705949 2.371142
3 800 3.416559 0.8702603 2.366692
3 850 3.416537 0.8703125 2.365698
3 900 3.425045 0.8696274 2.371548
3 950 3.432243 0.8690331 2.377062
3 1000 3.439126 0.8683348 2.376101
4 50 3.630279 0.8521719 2.358107
4 100 3.387673 0.8705752 2.278085
4 150 3.367977 0.8715175 2.297628
4 200 3.391339 0.8698552 2.315336
4 250 3.397939 0.8691872 2.324676
4 300 3.414306 0.8684292 2.335287
4 350 3.428006 0.8674090 2.351701
4 400 3.414278 0.8681991 2.342690
4 450 3.422546 0.8680037 2.339973
4 500 3.435872 0.8668872 2.352222
4 550 3.457865 0.8652164 2.366932
4 600 3.459655 0.8650159 2.373913
4 650 3.471678 0.8643243 2.382173
4 700 3.460265 0.8652149 2.382589
4 750 3.457638 0.8653354 2.377797
4 800 3.466524 0.8645041 2.387685
4 850 3.475391 0.8638673 2.394403
4 900 3.473758 0.8640069 2.395428
4 950 3.475577 0.8639833 2.399252
4 1000 3.480137 0.8635879 2.396478
5 50 3.634573 0.8536924 2.424409
5 100 3.441891 0.8680166 2.357887
5 150 3.438558 0.8687020 2.348839
5 200 3.447921 0.8679210 2.357030
5 250 3.445761 0.8678685 2.351814
5 300 3.461938 0.8668119 2.356273
5 350 3.474068 0.8663297 2.366627
5 400 3.472736 0.8663673 2.371855
5 450 3.484163 0.8658671 2.386352
5 500 3.479013 0.8659639 2.388138
5 550 3.487976 0.8652488 2.396109
5 600 3.483940 0.8655013 2.395897
5 650 3.482839 0.8654450 2.396242
5 700 3.486845 0.8652199 2.399072
5 750 3.489562 0.8649952 2.399393
5 800 3.481520 0.8656574 2.394991
5 850 3.483884 0.8653694 2.397931
5 900 3.486374 0.8651897 2.401086
5 950 3.491622 0.8648190 2.404430
5 1000 3.489109 0.8649509 2.401504
6 50 3.545494 0.8634129 2.387633
6 100 3.370738 0.8754375 2.270499
6 150 3.324218 0.8788253 2.242401
6 200 3.325099 0.8785586 2.248353
6 250 3.310366 0.8787499 2.229646
6 300 3.308517 0.8785813 2.226136
6 350 3.314117 0.8783089 2.231768
6 400 3.335421 0.8768952 2.246106
6 450 3.341025 0.8765118 2.253025
6 500 3.342478 0.8763059 2.259020
6 550 3.344140 0.8762680 2.263618
6 600 3.349776 0.8758517 2.263980
6 650 3.349604 0.8758837 2.270124
6 700 3.354695 0.8754727 2.271763
6 750 3.352293 0.8757367 2.270943
6 800 3.349832 0.8758928 2.271082
6 850 3.351123 0.8757741 2.270993
6 900 3.355586 0.8754891 2.274472
6 950 3.356418 0.8753894 2.277727
6 1000 3.356038 0.8753869 2.278907
7 50 3.610302 0.8565545 2.427758
7 100 3.419132 0.8694546 2.311198
7 150 3.348196 0.8738664 2.296725
7 200 3.350426 0.8742099 2.298975
7 250 3.361338 0.8734861 2.296747
7 300 3.354052 0.8736766 2.291524
7 350 3.356410 0.8732836 2.300522
7 400 3.360607 0.8730361 2.314271
7 450 3.378984 0.8718359 2.326552
7 500 3.371711 0.8722117 2.329695
7 550 3.381585 0.8714832 2.332530
7 600 3.385973 0.8710832 2.337464
7 650 3.392750 0.8705550 2.344415
7 700 3.396558 0.8702669 2.345666
7 750 3.400623 0.8699106 2.349314
7 800 3.397624 0.8701334 2.347351
7 850 3.397013 0.8702441 2.346639
7 900 3.398754 0.8701594 2.348979
7 950 3.396872 0.8703132 2.347884
7 1000 3.396852 0.8703013 2.349981
8 50 3.573368 0.8591485 2.358180
8 100 3.372672 0.8728824 2.217515
8 150 3.306652 0.8783508 2.214576
8 200 3.286337 0.8801453 2.204874
8 250 3.315634 0.8781739 2.212213
8 300 3.328194 0.8771627 2.225594
8 350 3.346730 0.8758373 2.232977
8 400 3.365483 0.8746302 2.239855
8 450 3.366661 0.8744888 2.241845
8 500 3.373131 0.8741160 2.246985
8 550 3.376471 0.8739811 2.248639
8 600 3.379604 0.8736706 2.255808
8 650 3.388022 0.8730291 2.261035
8 700 3.393164 0.8727154 2.264028
8 750 3.396171 0.8725472 2.265877
8 800 3.393228 0.8727890 2.265147
8 850 3.399940 0.8722822 2.266808
8 900 3.399878 0.8722923 2.269161
8 950 3.399456 0.8722807 2.270235
8 1000 3.398889 0.8723798 2.271384
9 50 3.606494 0.8555089 2.358468
9 100 3.366193 0.8732969 2.254176
9 150 3.343975 0.8748846 2.238050
9 200 3.374991 0.8726664 2.261978
9 250 3.383447 0.8717882 2.274104
9 300 3.391612 0.8713235 2.287463
9 350 3.416703 0.8694872 2.300250
9 400 3.413864 0.8698125 2.297301
9 450 3.419990 0.8692303 2.299995
9 500 3.415345 0.8697677 2.302730
9 550 3.421009 0.8693516 2.309552
9 600 3.426426 0.8689714 2.317989
9 650 3.421788 0.8693747 2.318333
9 700 3.432583 0.8686095 2.323436
9 750 3.437195 0.8682444 2.326529
9 800 3.433855 0.8685757 2.325045
9 850 3.433749 0.8686167 2.325484
9 900 3.437182 0.8683395 2.329172
9 950 3.441664 0.8679887 2.331806
9 1000 3.442840 0.8679349 2.332901
10 50 3.539946 0.8607481 2.352134
10 100 3.429792 0.8688410 2.300477
10 150 3.426095 0.8692708 2.322954
10 200 3.448365 0.8677114 2.334312
10 250 3.449458 0.8679522 2.346705
10 300 3.461077 0.8671903 2.365094
10 350 3.465022 0.8671389 2.362883
10 400 3.473098 0.8665413 2.364261
10 450 3.481201 0.8660166 2.375820
10 500 3.485639 0.8656491 2.376991
10 550 3.488041 0.8655458 2.376132
10 600 3.483644 0.8658596 2.378910
10 650 3.484351 0.8658747 2.379831
10 700 3.481911 0.8660159 2.378319
10 750 3.480581 0.8661290 2.378451
10 800 3.482129 0.8660043 2.381396
10 850 3.482690 0.8659659 2.384267
10 900 3.482570 0.8659827 2.386823
10 950 3.479670 0.8662218 2.384324
10 1000 3.479298 0.8662540 2.383978
11 50 3.501531 0.8621977 2.307540
11 100 3.359005 0.8720367 2.255849
11 150 3.346383 0.8733969 2.246381
11 200 3.333712 0.8744968 2.242532
11 250 3.352954 0.8732890 2.252668
11 300 3.388746 0.8705842 2.274085
11 350 3.393699 0.8703745 2.296223
11 400 3.388404 0.8706985 2.290694
11 450 3.406562 0.8693590 2.302527
11 500 3.402903 0.8697030 2.300781
11 550 3.398336 0.8701082 2.299115
11 600 3.395695 0.8702942 2.299859
11 650 3.404033 0.8696101 2.311132
11 700 3.399384 0.8699225 2.309540
11 750 3.400772 0.8697921 2.307757
11 800 3.402471 0.8696840 2.306131
11 850 3.404462 0.8695746 2.308169
11 900 3.400254 0.8698607 2.306035
11 950 3.400237 0.8698914 2.305679
11 1000 3.402585 0.8696643 2.307642
12 50 3.503650 0.8657952 2.343189
12 100 3.351280 0.8747690 2.257319
12 150 3.329858 0.8761009 2.247890
12 200 3.310827 0.8776089 2.225117
12 250 3.328143 0.8767307 2.233100
12 300 3.350426 0.8747676 2.245762
12 350 3.360104 0.8740504 2.249205
12 400 3.362182 0.8738319 2.255678
12 450 3.368256 0.8734343 2.264143
12 500 3.370660 0.8733203 2.270107
12 550 3.366126 0.8736205 2.265129
12 600 3.370909 0.8732466 2.269578
12 650 3.366734 0.8735692 2.274519
12 700 3.372165 0.8731573 2.276052
12 750 3.374747 0.8729924 2.279608
12 800 3.376639 0.8728835 2.283185
12 850 3.381213 0.8725750 2.283874
12 900 3.383216 0.8723828 2.283489
12 950 3.381552 0.8725642 2.282673
12 1000 3.380705 0.8726678 2.282174
13 50 3.540355 0.8615863 2.348150
13 100 3.381958 0.8732446 2.264972
13 150 3.386752 0.8728850 2.279050
13 200 3.354424 0.8748000 2.268177
13 250 3.348106 0.8751906 2.258045
13 300 3.387972 0.8725958 2.288311
13 350 3.403463 0.8715890 2.300869
13 400 3.409606 0.8713727 2.309603
13 450 3.407427 0.8714280 2.314796
13 500 3.414741 0.8709659 2.322057
13 550 3.420000 0.8707130 2.326621
13 600 3.421876 0.8705719 2.328771
13 650 3.425400 0.8702796 2.327997
13 700 3.427226 0.8702033 2.333441
13 750 3.428672 0.8700669 2.334847
13 800 3.432100 0.8698341 2.339001
13 850 3.433166 0.8697565 2.339785
13 900 3.435689 0.8695499 2.340632
13 950 3.431578 0.8697852 2.338391
13 1000 3.431521 0.8697667 2.338566
14 50 3.487475 0.8665858 2.346312
14 100 3.357082 0.8755412 2.258245
14 150 3.318401 0.8774349 2.240727
14 200 3.334607 0.8767042 2.263036
14 250 3.350345 0.8755530 2.273483
14 300 3.330249 0.8770658 2.273210
14 350 3.331501 0.8770855 2.274607
14 400 3.340525 0.8765946 2.284378
14 450 3.355156 0.8756431 2.292626
14 500 3.351840 0.8759516 2.296480
14 550 3.358708 0.8754984 2.299862
14 600 3.363976 0.8752819 2.302121
14 650 3.364503 0.8752572 2.304893
14 700 3.361423 0.8754347 2.304316
14 750 3.362935 0.8753596 2.307098
14 800 3.361838 0.8754581 2.305867
14 850 3.364471 0.8753280 2.310039
14 900 3.363132 0.8754191 2.311363
14 950 3.367094 0.8751545 2.313646
14 1000 3.366067 0.8752324 2.313678
15 50 3.513247 0.8633846 2.306771
15 100 3.457100 0.8662735 2.314283
15 150 3.416195 0.8699861 2.298366
15 200 3.407215 0.8706274 2.308065
15 250 3.417615 0.8698830 2.303766
15 300 3.409241 0.8707794 2.305799
15 350 3.410812 0.8708272 2.316896
15 400 3.419339 0.8701744 2.316402
15 450 3.432460 0.8694065 2.335292
15 500 3.424402 0.8698869 2.334830
15 550 3.422062 0.8700898 2.334798
15 600 3.423308 0.8700188 2.334501
15 650 3.428561 0.8696345 2.339787
15 700 3.430554 0.8693789 2.342990
15 750 3.431555 0.8693589 2.344961
15 800 3.436054 0.8690400 2.349257
15 850 3.433501 0.8692226 2.349622
15 900 3.434136 0.8691595 2.350384
15 950 3.436853 0.8689837 2.351461
15 1000 3.439417 0.8688168 2.353364
16 50 3.454917 0.8682484 2.343961
16 100 3.352725 0.8744573 2.357352
16 150 3.327503 0.8766737 2.326607
16 200 3.347200 0.8747500 2.333203
16 250 3.343790 0.8751199 2.331704
16 300 3.318405 0.8770002 2.329304
16 350 3.340877 0.8758698 2.333040
16 400 3.323703 0.8768449 2.326908
16 450 3.324944 0.8769416 2.330680
16 500 3.336631 0.8761511 2.337876
16 550 3.340343 0.8758326 2.339429
16 600 3.340256 0.8758981 2.340852
16 650 3.344837 0.8757028 2.345030
16 700 3.346342 0.8755146 2.347102
16 750 3.350777 0.8752823 2.349445
16 800 3.344177 0.8757008 2.345713
16 850 3.350885 0.8753309 2.348760
16 900 3.348429 0.8755368 2.346879
16 950 3.346667 0.8756308 2.344740
16 1000 3.346819 0.8756549 2.343974
17 50 3.467422 0.8680294 2.266348
17 100 3.269976 0.8800315 2.173429
17 150 3.262827 0.8799490 2.181185
17 200 3.270534 0.8801571 2.190869
17 250 3.263116 0.8810625 2.177968
17 300 3.265723 0.8808335 2.188209
17 350 3.274203 0.8805054 2.198478
17 400 3.280507 0.8801670 2.201575
17 450 3.280923 0.8801547 2.206459
17 500 3.289309 0.8796305 2.209300
17 550 3.298506 0.8790727 2.217084
17 600 3.299508 0.8789239 2.220405
17 650 3.303117 0.8786049 2.222215
17 700 3.303700 0.8786413 2.226727
17 750 3.301280 0.8788538 2.225466
17 800 3.308173 0.8783879 2.234396
17 850 3.304730 0.8786961 2.234176
17 900 3.304334 0.8787180 2.235214
17 950 3.304763 0.8787133 2.235338
17 1000 3.308531 0.8784199 2.237257
18 50 3.516922 0.8623627 2.306210
18 100 3.341255 0.8762368 2.218734
18 150 3.338977 0.8748038 2.209525
18 200 3.338084 0.8747692 2.199398
18 250 3.381760 0.8721145 2.242097
18 300 3.363596 0.8735106 2.237524
18 350 3.382437 0.8723787 2.242000
18 400 3.378309 0.8728203 2.249490
18 450 3.374048 0.8731747 2.257286
18 500 3.392557 0.8720082 2.272674
18 550 3.400602 0.8716078 2.280535
18 600 3.404384 0.8712375 2.286502
18 650 3.406092 0.8710642 2.290191
18 700 3.408676 0.8708748 2.288685
18 750 3.403202 0.8713174 2.288093
18 800 3.402655 0.8714293 2.290021
18 850 3.398262 0.8717300 2.289369
18 900 3.401136 0.8715026 2.291645
18 950 3.397890 0.8718426 2.292179
18 1000 3.398516 0.8717779 2.291726
19 50 3.454601 0.8678595 2.303757
19 100 3.289994 0.8790099 2.203701
19 150 3.199010 0.8853453 2.171613
19 200 3.230360 0.8833461 2.173840
19 250 3.225462 0.8839761 2.171680
19 300 3.242321 0.8827249 2.178389
19 350 3.249192 0.8826446 2.185557
19 400 3.246966 0.8829120 2.185755
19 450 3.260617 0.8819685 2.193676
19 500 3.256427 0.8821794 2.194566
19 550 3.256032 0.8822996 2.199529
19 600 3.259094 0.8821025 2.200891
19 650 3.264685 0.8817269 2.204813
19 700 3.266361 0.8815717 2.206359
19 750 3.269810 0.8813900 2.209752
19 800 3.269079 0.8813265 2.208971
19 850 3.269908 0.8813409 2.209206
19 900 3.266108 0.8816276 2.207078
19 950 3.267229 0.8816112 2.208378
19 1000 3.265703 0.8816533 2.209534
20 50 3.472948 0.8662695 2.310621
20 100 3.353499 0.8735781 2.260434
20 150 3.311872 0.8769038 2.251101
20 200 3.313604 0.8769059 2.252280
20 250 3.312326 0.8770281 2.253337
20 300 3.319448 0.8764406 2.252794
20 350 3.310290 0.8772701 2.239427
20 400 3.321553 0.8765927 2.248384
20 450 3.327350 0.8762556 2.253859
20 500 3.333630 0.8759037 2.260695
20 550 3.342153 0.8752460 2.269272
20 600 3.343967 0.8751386 2.271281
20 650 3.344505 0.8751967 2.273031
20 700 3.336746 0.8756960 2.272267
20 750 3.339824 0.8755332 2.273968
20 800 3.342642 0.8753670 2.275156
20 850 3.341782 0.8753984 2.273987
20 900 3.339913 0.8755337 2.273824
20 950 3.339133 0.8755608 2.273518
20 1000 3.339994 0.8755231 2.272497
Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning parameter 'n.minobsinnode' was held constant at a value of 10
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were n.trees = 150, interaction.depth =
19, shrinkage = 0.1 and n.minobsinnode = 10.
min(mod_gbm$results$RMSE) # Training RMSE
[1] 3.19901
plot(mod_gbm)
p <- predict(mod_gbm, newdata = BostonTest_pp)
RMSE(BostonTest_pp$medv, p) # Test RMSE
[1] 2.838095
set.seed(123)
myControl <- trainControl(method = "cv", number = 5)
mod_svm <- train(medv ~ .,
data = BostonTrain_pp,
trControl = myControl,
method = "svmRadial",
tuneLength = 12)
mod_svm
Support Vector Machines with Radial Basis Function Kernel
381 samples
14 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 306, 304, 304, 305, 305
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 5.145241 0.7469286 3.015047
0.50 4.621223 0.7864458 2.741994
1.00 4.121562 0.8243700 2.553312
2.00 3.707207 0.8534623 2.378429
4.00 3.590632 0.8577854 2.306068
8.00 3.536759 0.8618186 2.288472
16.00 3.516954 0.8619748 2.323446
32.00 3.567969 0.8559992 2.396587
64.00 3.724730 0.8423458 2.516428
128.00 3.844120 0.8328852 2.574771
256.00 4.025249 0.8186351 2.645530
512.00 4.097624 0.8121341 2.665030
Tuning parameter 'sigma' was held constant at a value of 0.08676922
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.08676922 and C = 16.
min(mod_svm$results$RMSE) # Training RMSE
[1] 3.516954
plot(mod_svm)
p <- predict(mod_svm, newdata = BostonTest_pp)
RMSE(BostonTest_pp$medv, p) # Test RMSE
[1] 2.493898
set.seed(123)
myControl <- trainControl(method = "cv", number = 5)
mod_knn <- train(medv ~ .,
data = BostonTrain_pp,
trControl = myControl,
method = "knn",
tuneLength = 12)
mod_knn
k-Nearest Neighbors
381 samples
14 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 306, 304, 304, 305, 305
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 5.121897 0.7114327 3.211232
7 5.008402 0.7344395 3.151333
9 5.129724 0.7269107 3.250860
11 5.209405 0.7211939 3.281465
13 5.223066 0.7259728 3.276010
15 5.324880 0.7180671 3.306677
17 5.432628 0.7068527 3.379118
19 5.558463 0.6958238 3.465401
21 5.632407 0.6911103 3.501519
23 5.713394 0.6870887 3.561730
25 5.795456 0.6821562 3.603982
27 5.863948 0.6778956 3.659176
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 7.
min(mod_knn$results$RMSE) # Training RMSE
[1] 5.008402
plot(mod_knn)
p <- predict(mod_knn, newdata = BostonTest_pp)
RMSE(BostonTest_pp$medv, p) # Test RMSE
[1] 3.499724
Tree-based methods for regression and classification involve stratifying or segmenting the predictor space into a number of simple regions. In order to make a prediction for a given observation, we typically use the mean or the mode response value for the training observations in the region to which it belongs. Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision tree methods. Tree-based methods are simple and useful for interpretation. However, they typically are not competitive with the best supervised learning approaches in terms of prediction accuracy. Hence we also introduce bagging, random forests, and boosting. Each of these approaches involves producing multiple trees which are then combined to yield a single consensus prediction. We will see that combining a large number of trees can often result in dramatic improvements in prediction accuracy, at the expense of some loss in interpretation.
bfc <- read.csv("https://raw.githubusercontent.com/STAT-ATA-ASU/PredictiveModelBuilding/master/bodyfatClean.csv?token=AAO56AE3XCZANCK4V2FPH2LBTWG4I")
head(bfc, n = 3)
age weight_lbs height_in neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm
1 23 154.25 67.75 36.2 93.1 85.2 94.5 59.0 37.3
2 22 173.25 72.25 38.5 93.6 83.0 98.7 58.7 37.3
3 22 154.00 66.25 34.0 95.8 87.9 99.2 59.6 38.9
ankle_cm biceps_cm forearm_cm wrist_cm brozek_C bmi_C age_sq abdomen_wrist
1 21.9 32.0 27.4 17.1 12.6 23.6 529 68.1
2 23.4 30.5 28.9 18.2 6.9 23.3 484 64.8
3 24.0 28.8 25.2 16.6 24.6 24.7 484 71.3
am
1 181.9365
2 169.1583
3 195.5067
# Define an average person per the data
avgperson <- data.frame(age = 45, weight_lbs = 180, height_in = 70, neck_cm = 38,
chest_cm = 101, abdomen_cm = 93, hip_cm = 100, thigh_cm = 59, knee_cm = 39,
ankle_cm = 23, biceps_cm = 32, forearm_cm = 29, wrist_cm = 18, bmi_C = 25,
age_sq = 2175, abdomen_wrist = 74, am = 193)
avgperson
age weight_lbs height_in neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm
1 45 180 70 38 101 93 100 59 39
ankle_cm biceps_cm forearm_cm wrist_cm bmi_C age_sq abdomen_wrist am
1 23 32 29 18 25 2175 74 193
library(rpart)
tree <- rpart(brozek_C ~ weight_lbs + abdomen_cm, data = bfc)
tree
n= 251
node), split, n, deviance, yval
* denotes terminal node
1) root 251 14717.0800 19.00677
2) abdomen_cm< 91.9 131 3830.4570 13.90458
4) abdomen_cm< 85.45 65 1019.9860 10.67385 *
5) abdomen_cm>=85.45 66 1463.8580 17.08636
10) weight_lbs>=173.625 23 356.9748 13.96087 *
11) weight_lbs< 173.625 43 762.0247 18.75814 *
3) abdomen_cm>=91.9 120 3753.5350 24.57667
6) abdomen_cm< 103 81 1496.7200 22.28889 *
7) abdomen_cm>=103 39 952.3590 29.32821
14) abdomen_cm< 112.3 28 353.9671 27.37857 *
15) abdomen_cm>=112.3 11 221.0491 34.29091 *
library(rattle)
fancyRpartPlot(tree)
Figure 3.1: Draw the regions for this tree in class
predict(tree, newdata = avgperson)
1
22.28889
We now discuss the process of building a regression tree. Roughly speaking, there are two steps.
We divide the predictor space — that is, the set of possible values for \(X_1, X_2,\ldots, X_p\) — into \(J\) distinct and non-overlapping regions, \(R_1, R_2,\ldots,R_J\).
For every observation that falls into the region \(R_j\) , we make the same prediction, which is simply the mean of the response values for the training observations in \(R_j\).
For instance, suppose that in Step 1 we obtain two regions, \(R_1\) and \(R_2\), and that the response mean of the training observations in the first region is 10, while the response mean of the training observations in the second region is 20. Then for a given observation \(X = x\), if \(x \in R_1\) we will predict a value of 10, and if \(x \in R_2\) we will predict a value of 20.
We now elaborate on Step 1 above. How do we construct the regions \(R_1,\ldots,R_J\)? In theory, the regions could have any shape. However, we choose to divide the predictor space into high-dimensional rectangles, or boxes, for simplicity and for ease of interpretation of the resulting predictive model. The goal is to find boxes \(R_1,\ldots,R_J\) that minimize the RSS, given by
\[\sum_{j=1}^J\sum_{i\in R_j}(y_i - \hat{y}_{R_j})^2,\] where \(\hat{y}_{R_j}\) is the mean response for the training observations within the \(j^{\text{th}}\) box. Unfortunately, is is not computationally feasible to consider every possible partition of the feature space into \(J\) boxes. For this reason, we take a top-down, greed approach that is known as recursive binary splitting. The approach is top-down because it begins at the top to the tree (at which point all observations belong to a single region) and then successively splits the predictor space; each split is indicated via two new branches further down on the tree. It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.
In order to perform recursive binary splitting, we first select the predictor \(X_j\) and the cutpoint \(s\) such that splitting the predictor space into the regions \({X|X_j < s}\) and \({X|X_j \geq s}\) leads to the greatest possible reduction in RSS.
In order to perform recursive binary splitting, we first select the predictor \(X_j\) and the cutpoint \(s\) such that splitting the predictor space into the regions \({X|X_j < s}\) and \({X|X_j \geq s}\) leads to the greatest possible reduction in RSS. That is, we consider all predictors \(X_1, \ldots, X_p\), and all possible values of the cutpoint \(s\) for each of the predictors, and then choose the predictor and cutpoint such that the resulting tree has the lowest RSS. In greater detail, for any \(j\) and \(s\), we define the pair of half-planes
\[R_1(j,s)={X|X_j < s} \quad{\text{and }} R_2(j,s) = {X|X_j \geq s},\] and we seek the value of \(j\) and \(s\) that minimize the equation
\[\sum_{i:x_i\in R_1(j,s)}(y_i - \hat{y}_{R_1})^2 + \sum_{i:x_i\in R_2(j,s)}(y_i - \hat{y}_{R_2})^2,\] where \(\hat{y}_{R_1}\) is the mean response for the training observations in \(R_1(j,s)\), and \(\hat{y}_{R_2}\) is the mean response for the training observations in \(R_2(j,s)\).
Once the regions \(R_1, \ldots, R_J\) have been created, we predict the response for a given test observation using the mean of the training observations in the region to which that test observation belongs.
Draw the regions for the tree shown in Figure 3.2.
ptree <- rpart(brozek_C ~ weight_lbs + height_in, data = bfc)
fancyRpartPlot(ptree)
Figure 3.2: Use this tree representation to draw the Regions on paper
The decision trees discussed in the previous section suffer from high variance. This means that if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different. In contrast, a procedure with low variance will yield similar results if applied repeatedly to distinct data sets; linear regression tends to have low variance, if the ration of \(n\) to \(p\) is moderately large. Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method.
Recall that a given a set of \(n\) independent observations \(Z_1, Z_2, \ldots, Z_n\), each with variance \(\sigma^2\), the variance of the mean \(\bar{Z}\) of the observations is given by \(\sigma^2/n\). In other words, averaging a set of observations reduces variance. Hence a natural way to reduce the variance and increase the test set accuracy of a statistical learning method is to take many training sets from the population,, build a separate prediction model using each training set, and average the resulting predictions. In other words, we could calculate \(\hat{f}^1(x), \hat{f}^2(x),\ldots,\hat{f}^B(x)\) using \(B\) separate training sets, and average then in order to obtain a single low-variance statistical learning model, given by
\[\hat{f}_{\text{avg}}(x) = \frac{1}{B}\sum_{b=1}^B\hat{f}^b(x).\] Of course, this is not practical because we generally do not have access to multiple training sets. Instead, we can bootstrap. by taking repeated samples for the (single) training data set. In this approach we generate \(B\) different bootstrapped training data sets. We then train our method on the \(b^{\text{th}}\) bootstrapped training set in order to get \(\hat{f}^{*b}(x)\), and finally average all the predictions, to obtain
\[\hat{f}_{\text{bag}}(x) = \frac{1}{B}\sum_{b=1}^B\hat{f}^{*b}(x).\] When bagging trees are grown deep, and are not pruned. Hence each individual tree has high variance, but low bias. Averaging these \(B\) trees reduces the variance!
set.seed(34)
bs1 <- bfc[sample(1:251, 251, replace = TRUE), ]
bs2 <- bfc[sample(1:251, 251, replace = TRUE), ]
bs3 <- bfc[sample(1:251, 251, replace = TRUE), ]
bs4 <- bfc[sample(1:251, 251, replace = TRUE), ]
bs5 <- bfc[sample(1:251, 251, replace = TRUE), ]
bs6 <- bfc[sample(1:251, 251, replace = TRUE), ]
bs7 <- bfc[sample(1:251, 251, replace = TRUE), ]
bs8 <- bfc[sample(1:251, 251, replace = TRUE), ]
bs9 <- bfc[sample(1:251, 251, replace = TRUE), ]
library(rpart)
library(rattle) # students will have to install rattle
avgperson
age weight_lbs height_in neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm
1 45 180 70 38 101 93 100 59 39
ankle_cm biceps_cm forearm_cm wrist_cm bmi_C age_sq abdomen_wrist am
1 23 32 29 18 25 2175 74 193
# Create a tree to predict brozek_C
tree1 <- rpart(brozek_C ~.,
data = bs1)
rpart.plot::rpart.plot(tree1)
fancyRpartPlot(tree1)
#
(predict(tree1, newdata = avgperson) -> pt1)
1
20.47333
# 20.4733
avgperson
age weight_lbs height_in neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm
1 45 180 70 38 101 93 100 59 39
ankle_cm biceps_cm forearm_cm wrist_cm bmi_C age_sq abdomen_wrist am
1 23 32 29 18 25 2175 74 193
tree2 <- rpart(brozek_C ~.,
data = bs2)
rpart.plot::rpart.plot(tree2)
fancyRpartPlot(tree2)
#
(predict(tree2, newdata = avgperson) -> pt2)
1
24.19846
# 24.19846
avgperson
age weight_lbs height_in neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm
1 45 180 70 38 101 93 100 59 39
ankle_cm biceps_cm forearm_cm wrist_cm bmi_C age_sq abdomen_wrist am
1 23 32 29 18 25 2175 74 193
tree3 <- rpart(brozek_C ~.,
data = bs3)
rpart.plot::rpart.plot(tree3)
fancyRpartPlot(tree3)
#
(predict(tree3, newdata = avgperson) -> pt3)
1
19.9902
# 19.9902
avgperson
age weight_lbs height_in neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm
1 45 180 70 38 101 93 100 59 39
ankle_cm biceps_cm forearm_cm wrist_cm bmi_C age_sq abdomen_wrist am
1 23 32 29 18 25 2175 74 193
tree4 <- rpart(brozek_C ~.,
data = bs4)
rpart.plot::rpart.plot(tree4)
fancyRpartPlot(tree4)
#
(predict(tree4, newdata = avgperson) -> pt4)
1
20.71739
# 20.71739
avgperson
age weight_lbs height_in neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm
1 45 180 70 38 101 93 100 59 39
ankle_cm biceps_cm forearm_cm wrist_cm bmi_C age_sq abdomen_wrist am
1 23 32 29 18 25 2175 74 193
tree5 <- rpart(brozek_C ~.,
data = bs5)
rpart.plot::rpart.plot(tree5)
fancyRpartPlot(tree5)
#
(predict(tree5, newdata = avgperson) -> pt5)
1
22.10137
# 22.10137
avgperson
age weight_lbs height_in neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm
1 45 180 70 38 101 93 100 59 39
ankle_cm biceps_cm forearm_cm wrist_cm bmi_C age_sq abdomen_wrist am
1 23 32 29 18 25 2175 74 193
tree6 <- rpart(brozek_C ~.,
data = bs6)
rpart.plot::rpart.plot(tree6)
fancyRpartPlot(tree6)
#
(predict(tree6, newdata = avgperson) -> pt6)
1
20.296
# 20.296
avgperson
age weight_lbs height_in neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm
1 45 180 70 38 101 93 100 59 39
ankle_cm biceps_cm forearm_cm wrist_cm bmi_C age_sq abdomen_wrist am
1 23 32 29 18 25 2175 74 193
tree7 <- rpart(brozek_C ~.,
data = bs7)
rpart.plot::rpart.plot(tree7)
fancyRpartPlot(tree7)
#
(predict(tree7, newdata = avgperson) -> pt7)
1
19.73148
# 19.73148
avgperson
age weight_lbs height_in neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm
1 45 180 70 38 101 93 100 59 39
ankle_cm biceps_cm forearm_cm wrist_cm bmi_C age_sq abdomen_wrist am
1 23 32 29 18 25 2175 74 193
tree8 <- rpart(brozek_C ~.,
data = bs8)
rpart.plot::rpart.plot(tree8)
fancyRpartPlot(tree8)
#
(predict(tree8, newdata = avgperson) -> pt8)
1
23.21591
# 23.21591
avgperson
age weight_lbs height_in neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm
1 45 180 70 38 101 93 100 59 39
ankle_cm biceps_cm forearm_cm wrist_cm bmi_C age_sq abdomen_wrist am
1 23 32 29 18 25 2175 74 193
tree9 <- rpart(brozek_C ~.,
data = bs9)
rpart.plot::rpart.plot(tree9)
fancyRpartPlot(tree9)
#
(predict(tree9, newdata = avgperson) -> pt9)
1
23.16
# 23.16
mean(c(pt1, pt2, pt3, pt4, pt5, pt6, pt7, pt8, pt9))
[1] 21.54268
Random Forests provide an improvement over bagged trees by way of a small tweak the decorrelates the trees. As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of \(p\) predictors. The split is allowed to use only one of those \(m\) predictors. A fresh sample of \(m\) predictors is taken at each split, and typical we choose \(m \approx \sqrt{p}\)—that is, the number of predictors considered at each split is approximately equal to the square root of the total number of predictors.
In other words , in building a random forest, at each split in the tree, the algorithm is not even allowed to consider a majority of the available predictors. This may sound crazy, but it has a clever rationale. Suppose that there is one very strong predictor in the data set, along with a number of other moderately strong predictors. Then in the collection of bagged trees, most or all of the trees will use this strong predictor in the top split. Consequently, all of the bagged trees will be highly correlated. Unfortunately, averaging many highly correlated quantities does not lead to as large a reduction as averaging many uncorrelated quantities.In particular, this means that bagging will not lead to a substantial reduction in variance over a single tree in this setting.
Random forest overcome this problem by forcing each split to consider only a subset of the predictors. Therefore, on average \((p-m)/p\) of the splits will not even consider the strong predictor, and so other predictors will have more of a chance. We can think of this process as decorrelating the trees, thereby making the average of the resulting trees less variable and hence more reliable.